Is there statistical evidence of fewer births on Feb 29 in the US?

An evaluation using 2000-2014 data

Leap day aversion? A 2012 calandar of births.

Is there statistical evidence of fewer births on Feb 29 in the US? The 2012 calendar below, showing number of birth as represented by color and bubble size, makes it apparent that there can be quite a lot of variation in the number of births from one day to another. For example December 25th’s recorded number of births is small compared with its neighbors, and births on the weekends appear to be much lower than their weekday counterparts.

With February 29th coming up this week, you might wonder, are there also relatively fewer births on February 29th, circled on the calendar. Feb 29th is a weird birthday to have – only truly ‘celebrated’ every four years. But it’s hard to draw any conclusions about a statistical evidence from this visualization and a single year of data. In what follows, we’ll assess the question, ‘Is there statistical evidence of fewer births on February 29th based on number of births in the U.S. over a 15 year period, 2000 to 2014?’

births_df %>%
  filter(year == 2012) %>% 
  ggcalendar() + 
  aes(date = date) + 
  geom_point_calendar() +
  aes(size = births) +
  aes(color = births) +
  geom_text_calendar(aes(label = day(date)), 
                     color = "oldlace", 
                     size = 2) + 
  guides(
    colour = guide_legend("Births"),
    size = guide_legend("Births")
 ) + 
  geom_point_calendar(data = data.frame(date =
                                      as_date("2012-02-29")),
                      size = 7, color = "red", shape = 21, stroke = 2) + 
  scale_color_viridis_c() + 
  labs(title = "The year in 2012 in births")

Zooming in on the week of February 29, 2012 in the figure below, it’s easier to appreciate the dip in number of births that is observed on February 29th. But the question remains, Is this variability usual, or is there an aversion to birthing on February 29th that is statistically detectable?

births_df |>
  filter(date >= as.Date("2012-02-26") &
           date <= as.Date("2012-03-03")) |>
  ggplot() + 
  aes(x = date, y = births) +
  geom_point() + 
  geom_line(linetype = "dashed") + 
  geom_label(aes(label = wday(date, label = T))) + 
  geom_point(data = . %>% filter(date == as.Date("2012-02-29")),
             color = "red", shape = 1, size = 20, stroke = 2, alpha = .8) +
  labs(title = "Births the week of Feb 29, 2012")

I begin the analysis with a visual format that’s familiar to the audience, and where we can be pretty sure of the effects of relevant variables just by inspection

2000-2014 U.S. Births Data

The 15 years of data we’ll look at was available via the #TidyTuesday project – their October 2, 2018 dataset. Some data cleaning is required including constructing a full date variable from year, month and date_of_month variables. I also normalize the dates (to 2020) for superimposed within-year comparisons - importantly 2020 is a leap year so Feb 29th dates are valid and not dropped. I further include indicator variables that I anticipate to be important as well as an indicator variable that captures the condition of interest, ind_Feb_29th.

library(tidyverse)

births_path <- "https://raw.githubusercontent.com/EvaMaeRey/tableau/9e91c2b5ee803bfef10d35646cf4ce6675b92b55/tidytuesday_data/2018-10-02-us_births_2000-2014.csv"

library(ggcalendar)

readr::read_csv(births_path) %>% 
  filter(year != 2015) %>%
  mutate(month = str_pad(month, 2, pad = "0"),
         date_of_month = str_pad(date_of_month, 2, pad = "0")) %>% 
  mutate(date = paste(year, month, date_of_month, sep = "-") %>% as_date()) %>% 
  mutate(ind_holiday = 
           (month == "12" & date_of_month %in% 24:31) |
           (month == "07" & date_of_month %in% c("04", "05")) |
           (month == "01" & date_of_month %in% c("01", "02")) | 
           (month == "10" & date_of_month == "31") | 
           (month == "11" & date_of_month %in% 20:30) |
           (month == "02" & date_of_month %in% 14)
           ) |>
  mutate(date_in_2020 = paste(2020, month, date_of_month, sep = "-") %>% as_date()) |>
  mutate(ind_weekend = wday(date) == 1 | wday(date) == 7) |>
  mutate(ind_Feb_29th = month(date) == 2 & day(date) == 29) |>
  mutate(ind_13th = day(date) == 13) |>
  mutate(ind_Fri13th = wday(date) == 6 & day(date) == 13) ->
births_df

Visual exploration

Baseline visualization of the time series

Visualizing the U.S. births time series, we see that there is tremendous variability in number of birth per day. The standard deviation is 2326 in this time span, with the average number of births around 11400, as marked in the visualization below.

births_df |> 
  ggplot() + 
  aes(x = date, y = births) + 
  geom_point(size = .2) ->
time_series_base


add_y_mean <- function(){
  list(ggxmean::geom_y_mean(linetype = "dotted"),
  ggxmean::geom_y_mean_label() )
}

time_series_base + 
  add_y_mean()



round(sd(births_df$births))
#> [1] 2326

Univariate distribution

Looking at the univariate distribution of births per day, the bi-modal pattern is striking.

# ggchalkboard:::geoms_chalk_on()

ggplot(births_df) + 
  aes(x = births) + 
  geom_histogram() + 
  ggxmean::geom_x_mean() +
  ggxmean::geom_x_mean_label() + 
  geom_rug(alpha = .2) + 
  labs(x = "Number of births") + 
  # ggchalkboard::theme_chalkboard() + 
  labs(title = "Distribution of Number of Births in the U.S. each day from 2000-2014" %>% str_wrap(45)) ->
univariate_base; univariate_base

Exploring bi-modality with weekend indicator

Breaking this data up, we explore the hypothesis that preference for scheduled birth (inducements or c-sections) is for weekdays. In the figure below see that most of the bimodality is explained by weekend v. weekday effects. The difference in means for these groups is substantial - around 4650 difference in number of births!

univariate_base +
  facet_wrap(~ind2cat::ind_recode(ind_weekend), ncol = 1) 

Day of weeks effects

When we return to our time-series plot, but breaking the plots up by weekend, you may notice further bimodality within the ‘weekend’ subplot suggesting differences within the weekend for Saturday and Sunday.

time_series_base + 
  facet_wrap(~ind2cat::ind_recode(ind_weekend), ncol = 1) + 
  labs(title = "We explore the bimodal distribution looking at 'weekend effects'" %>% str_wrap(45))

Following on that observation, we investigate by-weekday (Sunday, Monday, Tuesday etc) differences in number of births below. We look first at the histogram and then the time series plot.

univariate_base +
  facet_wrap(~wday(date, label = T, week_start = "Monday"), ncol = 1) 

We add a confidence interval around the mean based on a t-test.

last_plot() + 
  ggxmean:::geom_tdist(color = "red")

time_series_base + 
  facet_wrap(~wday(date, label = T, 
                   week_start = "Monday"), ncol = 2) + 
  labs(title = "Number of births 2000-2014 by day of week" %>% str_wrap(45))

Holidays and outlying observations

The by-day-of-the-week exploration (and weekend vs. weekday) further exposes some outlying observations. We next explore if these lower-than-usual their counterparts are by and large holiday or holiday adjacent days. Due to time constraints, I’m relying a quick indicator variable that I created that include many holiday and adjacent dates.

last_plot() -> tplot
tplot[[2]] <- NULL

tplot + 
  geom_point(alpha = .5, size = .5) + 
  aes(color = ind2cat::ind_recode(ind_holiday)) +
  labs(color = NULL) + 
  theme(legend.position = 'top', legend.justification = "left")

This exploration suggests that holidays drive much of the out-of-character observations.

Outlyingness might also be due to aversion due to superstition - for example the 13th of each month, especially Fridays the 13th, and Halloween might be more outlying though not holidays.

Also, holiday adjacency might also lead to lower number of births - for example federal holidays that fall on the weekend are often observed on a Monday. The outliers that we observe on Monday are not categorized as holiday, but this likely due to the imperfect ind_holiday coding. President’s day was not captured in my coding, for example and always falls on a Monday.

Because February 29 is not a holiday or holiday-adjacent, it is appropriate to prune our data to relevant comparisons and try to remove likely holidays; this should lead to more precise estimates in our final analysis.

Due to time constraints and convenience, our pruning will be statistical instead of using substantive knowledge (i.e. using lists of federal and celebrated holidays, etc).

The approach is to prune observations for which a simple linear model’s error is greater than 3 standard deviations from the mean residual error (zero).

The linear model contain date and day of the week as categories.

To start, I visualize the parallel lines model fit.

compute_panel_lm_parallel <- function(data, scales){
  
  model <- lm(y ~ x + category, data = data)
  
  data |>
    mutate(y = model$fitted)
  
}

StatParallel <- ggplot2::ggproto(`_class` = "StatParallel",
                           `_inherit` = ggplot2::Stat,
                           required_aes = c("x", "y", "category"),
                           compute_panel = compute_panel_lm_parallel,
                           default_aes = aes(color = after_stat(category)))

geom_ols_linear_parallel <- function(mapping = NULL, data = NULL,
                           position = "identity", na.rm = FALSE,
                           show.legend = NA,
                           inherit.aes = TRUE, ...) {
  ggplot2::layer(
    stat = StatParallel, # proto object from Step 2
    geom = ggplot2::GeomLine, # inherit other behavior
    data = data,
    mapping = mapping,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(na.rm = na.rm, ...)
  )
}

ggplot(births_df) +
  aes(x = date, y = births,
      color = wday(date, label = T), category = wday(date, label = T)) +
  geom_point(alpha = .15) + 
  geom_ols_linear_parallel()

Now, we estimate model outside of the visualization tool, and collect the residuals. Then we visualize the distribution of the residuals and mark the bound of 3 standard deviations from the mean, beyond which we’ll prune our dataset.

m <- lm(births ~ date + wday(date, label = T), data = births_df)
births_df$residuals_wday <- m$residuals

ggplot(births_df) +
  aes(x = residuals_wday) +
  geom_histogram() + 
  ggxmean::geom_x_mean() + 
  ggxmean:::geom_x3sd(linetype = "dashed")


sd(births_df$residuals_wday)
#> [1] 895.3903

Dropping the large residuals results in 116 days dropped from our dataset.

births_df_pruned <- births_df |>
  filter(abs(residuals_wday) < 
           3*sd(births_df$residuals_wday))

time_series_base %+% births_df_pruned +
  facet_wrap(~wday(date, label = T), ncol = 2) + 
  labs(title = "Outlier pruned" %>% str_wrap(45)) 


dim(births_df) - dim(births_df_pruned)
#> [1] 116   0

In the remainder of the exploration and analysis, we’ll use the pruned version of the data, unless otherwise specified.

Modeling…

births_df_pruned
#> # A tibble: 5,363 × 13
#>     year month date_of_month day_of_week births date       ind_holiday
#>    <dbl> <chr> <chr>               <dbl>  <dbl> <date>     <lgl>      
#>  1  2000 01    01                      6   9083 2000-01-01 TRUE       
#>  2  2000 01    02                      7   8006 2000-01-02 TRUE       
#>  3  2000 01    03                      1  11363 2000-01-03 FALSE      
#>  4  2000 01    04                      2  13032 2000-01-04 FALSE      
#>  5  2000 01    05                      3  12558 2000-01-05 FALSE      
#>  6  2000 01    06                      4  12466 2000-01-06 FALSE      
#>  7  2000 01    07                      5  12516 2000-01-07 FALSE      
#>  8  2000 01    08                      6   8934 2000-01-08 FALSE      
#>  9  2000 01    09                      7   7949 2000-01-09 FALSE      
#> 10  2000 01    10                      1  11668 2000-01-10 FALSE      
#> # ℹ 5,353 more rows
#> # ℹ 6 more variables: date_in_2020 <date>, ind_weekend <lgl>,
#> #   ind_Feb_29th <lgl>, ind_13th <lgl>, ind_Fri13th <lgl>, residuals_wday <dbl>
births_base_model <- lm(formula = 
                          births ~ factor(year) + month + factor(day_of_week), 
                        data = births_df_pruned) 

ggplot(fortify(births_base_model)) + 
  aes(x = .fitted, y = .resid) + 
  geom_point(alpha = .2) + 
  geom_hline(yintercept = 0, color = "darkred") + 
  geom_smooth()


births_base_model_feb29 <- lm(formula = 
                          births ~ factor(year) + month + factor(day_of_week) + ind_Feb_29th, 
                        data = births_df_pruned) 

summary(births_base_model_feb29)
#> 
#> Call:
#> lm(formula = births ~ factor(year) + month + factor(day_of_week) + 
#>     ind_Feb_29th, data = births_df_pruned)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2662.27  -224.20     9.72   227.49  2092.26 
#> 
#> Coefficients:
#>                      Estimate Std. Error  t value Pr(>|t|)    
#> (Intercept)          11718.19      31.98  366.443  < 2e-16 ***
#> factor(year)2001       -77.32      30.95   -2.499 0.012498 *  
#> factor(year)2002      -111.48      30.95   -3.602 0.000318 ***
#> factor(year)2003        65.25      30.95    2.109 0.035032 *  
#> factor(year)2004        83.22      30.90    2.693 0.007097 ** 
#> factor(year)2005       187.62      30.90    6.071 1.36e-09 ***
#> factor(year)2006       546.70      30.92   17.678  < 2e-16 ***
#> factor(year)2007       666.50      30.99   21.507  < 2e-16 ***
#> factor(year)2008       441.51      30.94   14.269  < 2e-16 ***
#> factor(year)2009       132.70      30.95    4.288 1.83e-05 ***
#> factor(year)2010      -233.42      30.95   -7.543 5.38e-14 ***
#> factor(year)2011      -377.26      30.90  -12.208  < 2e-16 ***
#> factor(year)2012      -397.76      30.92  -12.865  < 2e-16 ***
#> factor(year)2013      -457.70      30.95  -14.790  < 2e-16 ***
#> factor(year)2014      -359.70      30.95  -11.623  < 2e-16 ***
#> month02                137.82      28.04    4.915 9.15e-07 ***
#> month03                120.59      27.33    4.413 1.04e-05 ***
#> month04                 16.46      27.55    0.598 0.550196    
#> month05                271.60      27.56    9.857  < 2e-16 ***
#> month06                476.20      27.55   17.284  < 2e-16 ***
#> month07                831.77      27.54   30.206  < 2e-16 ***
#> month08                885.85      27.34   32.398  < 2e-16 ***
#> month09               1155.37      27.81   41.548  < 2e-16 ***
#> month10                371.83      27.33   13.606  < 2e-16 ***
#> month11                422.01      28.05   15.047  < 2e-16 ***
#> month12                442.50      27.76   15.942  < 2e-16 ***
#> factor(day_of_week)2  1039.03      21.32   48.737  < 2e-16 ***
#> factor(day_of_week)3   811.81      21.32   38.080  < 2e-16 ***
#> factor(day_of_week)4   849.50      21.42   39.663  < 2e-16 ***
#> factor(day_of_week)5   548.36      21.43   25.594  < 2e-16 ***
#> factor(day_of_week)6 -3591.86      21.24 -169.077  < 2e-16 ***
#> factor(day_of_week)7 -4635.93      21.25 -218.211  < 2e-16 ***
#> ind_Feb_29thTRUE      -861.31     208.23   -4.136 3.58e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 414 on 5330 degrees of freedom
#> Multiple R-squared:  0.9677, Adjusted R-squared:  0.9675 
#> F-statistic:  4991 on 32 and 5330 DF,  p-value: < 2.2e-16

Narrow to calendar peers

# ggchalkboard:::geoms_chalk_off()

births_df |> 
     filter(date_in_2020 >= as.Date("2020-02-20")) |>
     filter(date_in_2020 <= as.Date("2020-03-13")) |>
  ggplot() + 
  aes(x = date_in_2020, y  = births) + 
  geom_line( alpha= .2) +
  geom_point(aes(shape = ind2cat::ind_recode(ind_weekend), 
                 fill = ind2cat::ind_recode(ind_Feb_29th),
                 size = ind2cat::ind_recode(ind_Feb_29th)),
             shape = 21) +
  geom_text(aes(label = wday(date, label = T)), 
            vjust = -0.52, size = 3) + 
  # geom_text(aes(label = births), 
  #           vjust = 1.2) + 
  facet_wrap(~year) +
  scale_size_discrete(range = c(4,6))


last_plot() +
  aes(linetype = ind2cat::ind_recode(ind_weekend))



last_plot() + 
  facet_null() + 
  aes(group = paste(year, ind_weekend, isoweek(date))) + 
  aes(color = year)

births_model_feb29_peers <- lm(formula = 
                          births ~ factor(year) + date_in_2020 + factor(day_of_week) + ind_Feb_29th, 
                        data = births_df_pruned |> 
     filter(date_in_2020 >= as.Date("2020-02-16")) |>
     filter(date_in_2020 <= as.Date("2020-03-9"))) 

summary(births_model_feb29_peers)
#> 
#> Call:
#> lm(formula = births ~ factor(year) + date_in_2020 + factor(day_of_week) + 
#>     ind_Feb_29th, data = filter(filter(births_df_pruned, date_in_2020 >= 
#>     as.Date("2020-02-16")), date_in_2020 <= as.Date("2020-03-9")))
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -997.5 -171.2   -9.9  178.3 1042.1 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          -21789.586  42610.810  -0.511    0.609    
#> factor(year)2001        -99.521     85.558  -1.163    0.246    
#> factor(year)2002        -48.996     85.548  -0.573    0.567    
#> factor(year)2003        113.185     85.558   1.323    0.187    
#> factor(year)2004        138.110     84.382   1.637    0.103    
#> factor(year)2005         96.470     85.519   1.128    0.260    
#> factor(year)2006        477.589     85.508   5.585 5.09e-08 ***
#> factor(year)2007        734.570     85.558   8.586 4.40e-16 ***
#> factor(year)2008        559.862     84.381   6.635 1.44e-10 ***
#> factor(year)2009        384.811     85.548   4.498 9.69e-06 ***
#> factor(year)2010       -128.954     85.559  -1.507    0.133    
#> factor(year)2011       -396.802     85.519  -4.640 5.14e-06 ***
#> factor(year)2012       -495.252     84.343  -5.872 1.11e-08 ***
#> factor(year)2013       -566.769     85.548  -6.625 1.53e-10 ***
#> factor(year)2014       -405.133     85.558  -4.735 3.33e-06 ***
#> date_in_2020              1.823      2.326   0.784    0.434    
#> factor(day_of_week)2   1112.832     59.098  18.830  < 2e-16 ***
#> factor(day_of_week)3    988.488     58.821  16.805  < 2e-16 ***
#> factor(day_of_week)4   1034.869     58.743  17.617  < 2e-16 ***
#> factor(day_of_week)5    841.286     58.827  14.301  < 2e-16 ***
#> factor(day_of_week)6  -3204.244     58.736 -54.554  < 2e-16 ***
#> factor(day_of_week)7  -4258.234     58.818 -72.397  < 2e-16 ***
#> ind_Feb_29thTRUE       -867.752    146.946  -5.905 9.22e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 285.9 on 311 degrees of freedom
#> Multiple R-squared:  0.9835, Adjusted R-squared:  0.9824 
#> F-statistic:   844 on 22 and 311 DF,  p-value: < 2.2e-16